Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

REF: Check before calling DTA._from_sequence instead of catching #29589

Closed
wants to merge 6 commits into from

Conversation

jbrockmendel
Copy link
Member

Comment in core.apply is unrelated, just doesnt merit its own PR.

result = arr._from_sequence(result, dtype="datetime64[ns, UTC]")
result = result.astype(dtype)
except TypeError:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose you can end up here with a different dtype that is not convertible to datetime64 (there is no guarantee that your groupby function or aggregation returns the same dtype as the original column). In such a case, we should not raise the error, but rather fall back to the original result as done now?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAICT the relevant functions are all dtype-preserving (well, e.g. sometimes int can be cast to float, but not relevant for datetime64). Is there a case that im forgetting?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you mean with "the relevant functions" ? I think this gets used for user defined functions as well?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

most of the calls that get here go through our cython functions, but you're right that we can also get here from _python_agg_general. But in the case of user defined functions that may not be dtype-preserving, we shouldn't be doing this casting at all, regardless of whether we pre-check vs catch

@jreback jreback added Clean Datetime Datetime data dtype labels Nov 13, 2019
@jreback jreback added this to the 1.0 milestone Nov 13, 2019
@jorisvandenbossche
Copy link
Member

you're right that we can also get here from _python_agg_general. But in the case of user defined functions that may not be dtype-preserving, we shouldn't be doing this casting at all, regardless of whether we pre-check vs catch

Not fully sure if it should try this casting or not, but right now it does reach here, so your change has implications for that.

Consider this example:

In [5]: df = pd.DataFrame({'a': [0, 0, 1, 1], "b": pd.date_range("2012-01-01", periods=4, tz='UTC')}) 

In [6]: df.groupby('a')['b'].agg(lambda x: x.iloc[-1] - x.iloc[0])
Out[6]: 
a
0   1 days
1   1 days
Name: b, dtype: timedelta64[ns]

That will fail to convert to datetime, raise a TypeError and thus now fail with this branch.

Now, you are right that we maybe shouldn't try to cast (or at least only if the scalars are actual datetime objects), because if you write a function that returns integers, that now incorrectly casts back to datetime64:

In [3]: df.groupby('a')['b'].agg(lambda x: x.iloc[0].year)  
Out[3]: 
a
0   1970-01-01 00:00:00.000002012+00:00
1   1970-01-01 00:00:00.000002012+00:00
Name: b, dtype: datetime64[ns, UTC]

However, this case is a regression, as it worked correctly in pandas 0.23 (started to fail in 0.24, probably with the datetime arrays refactor).
(I seem to remember some discussion about this, it might be a reason that DatetimeArray._from_sequence should be more strict to only accept actual scalars of the array's type. But this is a digression here, will see to open a new issue for it )

@jorisvandenbossche
Copy link
Member

About the "need" to cast here, eg this example would otherwise not have tz-aware result I think:

In [2]: df.groupby('a')['b'].agg(lambda x: x.iloc[0])                                                                                                                                         
Out[2]: 
a
0   2012-01-01 00:00:00+00:00
1   2012-01-03 00:00:00+00:00
Name: b, dtype: datetime64[ns, UTC]

@jbrockmendel
Copy link
Member Author

@jorisvandenbossche first I agree that this should be on hold until we sort out the desired when-to-try-casing behavior.

My read of your last two comments suggests that you agree with my previous comment that we should not be re-casting for UDFs?

@jreback
Copy link
Contributor

jreback commented Nov 14, 2019

My read of your last two comments suggests that you agree with my previous comment that we should not be re-casting for UDFs?

I agree with this statement

@jorisvandenbossche
Copy link
Member

It would also be a regression to not cast (see the example in my last comment above; that would no longer output the correct dtype if there was no cast back to the original type).

So I think we should try some casting, but not in such a liberal way as done now (similar to the "soft conversion" in infer_objects, which will not cast integers to datetime64, but will cast datetime objects)

@jbrockmendel
Copy link
Member Author

@jorisvandenbossche for non-UDFs the behavior in this PR (and master) is correct. For UDFs you have convinced me that the casting should not be done.

Does this accurate reflect your views?

@jorisvandenbossche
Copy link
Member

Not really, in my last comment I said: "So I think we should try some casting, but not in such a liberal way as done now."

Now, "trying to cast" is always a box of pandora (but we do that a lot in pandas in all kinds of cases), I am mainly noting that not doing it would give a regression in my example above (#29589 (comment))

@jbrockmendel
Copy link
Member Author

Locally, if I disable this casting exclusively for UDFs, both the df.groupby('a')['b'].agg(lambda x: x.iloc[0]) and df.groupby('a')['b'].agg(lambda x: x.iloc[0].year) work correctly.

"So I think we should try some casting, but not in such a liberal way as done now." is vague compared to "do this casting only for non-UDFs". Can you phrase your preferred approach more specifically?

@jorisvandenbossche
Copy link
Member

Can you show the result of what you get locally?

@jbrockmendel
Copy link
Member Author

Can you show the result of what you get locally?

After adding a cy=True kwarg to _try_cast and passing cy=False in the two _try_cast calls within _python_agg_general (all in core.groupby.groupby) that just disables the casting in question:

>>> df = pd.DataFrame({'a': [0, 0, 1, 1], "b": pd.date_range("2012-01-01", periods=4, tz='UTC')})

>>> df.groupby('a')['b'].agg(lambda x: x.iloc[0])
a
0   2012-01-01 00:00:00+00:00
1   2012-01-03 00:00:00+00:00
dtype: datetime64[ns, UTC]

>>> df.groupby('a')['b'].agg(lambda x: x.iloc[0].year)
a
0    2012
1    2012
dtype: int64

Does this match the expected/correct output?

@jorisvandenbossche
Copy link
Member

Not sure what we do differently, what I tried was to just comment out the full block with casting in _try_cast (but I might be missing complications of this at other locations):

diff --git a/pandas/core/groupby/groupby.py b/pandas/core/groupby/groupby.py
index 280f1e88b..ab347b0bd 100644
--- a/pandas/core/groupby/groupby.py
+++ b/pandas/core/groupby/groupby.py
@@ -788,29 +788,29 @@ b  2""",
         else:
             dtype = obj.dtype
 
-        if not is_scalar(result):
-            if is_datetime64tz_dtype(dtype):
-                # GH 23683
-                # Prior results _may_ have been generated in UTC.
-                # Ensure we localize to UTC first before converting
-                # to the target timezone
-                arr = extract_array(obj)
-                try:
-                    result = arr._from_sequence(result, dtype="datetime64[ns, UTC]")
-                    result = result.astype(dtype)
-                except TypeError:
-                    # _try_cast was called at a point where the result
-                    # was already tz-aware
-                    pass
-            elif is_extension_array_dtype(dtype):
-                # The function can return something of any type, so check
-                # if the type is compatible with the calling EA.
-
-                # return the same type (Series) as our caller
-                cls = dtype.construct_array_type()
-                result = try_cast_to_ea(cls, result, dtype=dtype)
-            elif numeric_only and is_numeric_dtype(dtype) or not numeric_only:
-                result = maybe_downcast_to_dtype(result, dtype)
+        # if not is_scalar(result):
+        #     if is_datetime64tz_dtype(dtype):
+        #         # GH 23683
+        #         # Prior results _may_ have been generated in UTC.
+        #         # Ensure we localize to UTC first before converting
+        #         # to the target timezone
+        #         arr = extract_array(obj)
+        #         try:
+        #             result = arr._from_sequence(result, dtype="datetime64[ns, UTC]")
+        #             result = result.astype(dtype)
+        #         except TypeError:
+        #             # _try_cast was called at a point where the result
+        #             # was already tz-aware
+        #             pass
+        #     elif is_extension_array_dtype(dtype):
+        #         # The function can return something of any type, so check
+        #         # if the type is compatible with the calling EA.
+
+        #         # return the same type (Series) as our caller
+        #         cls = dtype.construct_array_type()
+        #         result = try_cast_to_ea(cls, result, dtype=dtype)
+        #     elif numeric_only and is_numeric_dtype(dtype) or not numeric_only:
+        #         result = maybe_downcast_to_dtype(result, dtype)
 
         return result

and then I get this (notice that the resulting dtype is not tz-aware here, in constrast with your example):

In [1]: df = pd.DataFrame({"a": [0, 0, 1, 1], 
   ...:                    "b": pd.date_range("2012-01-01", periods=4, tz='UTC')})

In [2]: df.groupby('a')['b'].agg(lambda x: x.iloc[0])
Out[2]: 
a
0   2012-01-01
1   2012-01-03
Name: b, dtype: datetime64[ns]

@jbrockmendel
Copy link
Member Author

Not sure what we do differently, what I tried was to just comment out the full block with casting

I just changed the if not is_datetime64tz_dtype(result.dtype): to if not is_datetime64tz_dtype(result.dtype) and not cy:

I'm now working on a new branch that should handle the cases you've mentioned correctly.

@jbrockmendel
Copy link
Member Author

@jorisvandenbossche im working on adding the examples from here into #29641, want to confirm the expected output. The results I posed here did not have a name for the Series; should the name be "b"?

@jbrockmendel
Copy link
Member Author

Closing in favor of #29641

@jbrockmendel jbrockmendel deleted the faster-gb2 branch November 21, 2019 19:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Clean Datetime Datetime data dtype
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants